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\\  Abstract 

This  paper  describes  a  parallelized  system,  ASSURE,  for  computing  the  reliabil¬ 
ity  of  embedded  avionics  flight  control  systems  which  are  able  to  reconfigure  them¬ 
selves  in  the  event  of  failure.  ASSURE  accepts  a  grammar  that  describes  a  reliability 
semi-Markov  state-space.  From  this  it  creates  a  parallel  program  that  simultaneously 
generates  and  analyzes  the  state-space,  placing  upper  and  lower  bounds  on  the  prob¬ 
ability  of  system  failure.  ASSURE  is  implemented  on  a  32-node  Intel  iPSC/860,  and 
has  achieved  high  processor  efficiencies  on  real  problems.  Through  a  combination  of 
improved  algorithms,  exploitation  of  parallelism,  and  use  of  an  advanced  micropro¬ 
cessor  architecture,  ASSURE  has  reduced  the  execution  time  on  substantial  problems 
by  a  factor  of  one  thousand  over  previous  workstation  implementations.  Furthermore, 
ASSURE’s  parallel  execution  rate  on  the  iPSC/860  is  an  order  of  magnitude  faster 
than  its  serial  execution  rate  on  a  Cray-2  supercomputer.  While  dynamic  load  balanc¬ 
ing  is  necessary  for  ASSURE’s  good  performance,  it  is  needed  only  infrequently;  the 
particular  method  of  load  balancing  used  does  not  substantially  affect  performance. 
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Introduction 


For  some  time  reliability  analyses  of  fault-tolerant  flight  control  systems  have  used  auto¬ 
mated  tools  such  as  ARIES,  SURF,  CARE  III,  ASSIST,  and  SURE  for  determining  system 
failure  probabilities [7]  (also  see  the  excellent  survey  in  [8]).  On  large  reliability  models  these 
programs  require  a  great  deal  of  computational  effort.  Typically  the  complexity  of  model 
analysis  grows  exponentially  in  the  size  of  the  model,  forcing  design  engineers  to  use  rel¬ 
atively  simple  reliability  models.  These  tools  simply  are  not  equal  to  the  challenge  posed 
by  the  analysis  of  highly  complex,  highly  reliable  control  systems  such  as  the  Integrated 
Airframe  Propulsion  System  Architecture  (IAPSA)  [3].  These  tools  stand  to  benefit  from 
parallel  processing,  if  parallelism  can  be  found  and  efficiently  exploited. 

This  paper  concerns  two  tools,  ASSIST  and  SURE,  that  are  used  to  place  upper  and 
lower  bounds  on  the  probability  of  failure  in  computer  systems  which  can  reconfigure  them¬ 
selves  in  the  event  of  failure.  We  describe  how  these  tools  were  rewritten  to  clearly  expose 
parallelism,  were  extended  to  permit  the  construction  of  highly  complex  reliability  mod¬ 
els,  and  were  subsequently  ported  to  a  distributed  memory  parallel  architecture,  the  Intel 
iPSC/860.  The  parallel  run-time  performance  on  32  nodes  of  the  iPSC/860  is  one  thousand 
times  faster  than  the  run-time  of  the  original  tools  on  a  Sun  3/150.  This  dramatic  gain  is 
brought  about  by  a  combination  of  algorithmic  improvements,  parallelism,  and  use  of  the 
advanced  (in  1990)  Intel  80860  [4]  microprocessor  architecture.  We  exploit  the  parallelism 
inherent  in  searching  state-spaces,  a  topic  of  active  research  interest  (e.g,  see  [6,  18,  16]). 
Our  contribution  is  to  show  how  to  transform  a  reliability  model  into  a  form  that  can  be 
efficiently  and  automatically  processed  on  a  parallel  architecture.  Our  experience  empirically 
proves  the  potential  of  parallel  processing  on  a  class  of  applications  which  hitherto  have  not 
exploited  parallelism.  Our  performance  comparisons  between  the  iPSC/860  and  Cray-2  also 
empirically  prove  the  clear  superiority  of  parallel  processing  on  scalar  applications  of  this 
type. 


Our  parallelized  tool  dynamically  balances  the  workload  whenever  some  processor  goes 
idle.  On  the  largest  problems  studied  dynamic  load  balancing  was  called  so  infrequently 
that  its  cost  does  not  detract  greatly  from  the  overall  performance.  However,  failure  to  use 
dynamic  load  balancing  can  significantly  degrade  performance. 

This  paper  is  organized  as  follows.  §2  provides  background  information  on  reliability 
modeling,  and  the  tools  of  interest.  §3  introduces  the  ASSIST  language,  §4  describes  the 
model  analysis  underlying  the  reliability  tools,  and  §5  describes  how  we  combined  and  par¬ 
allelized  two  existing  tools.  §6  reports  on  the  measured  performance  of  our  tool  on  a  suite  r 
of  network  reliability  problems.  §7  summaries  this  paper. 


2  Background 
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We  are  interested  in  numerically  estimating  the  reliability  of  complex  highly  reliable  com- _ 
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have  any  one  of  a  set  of  states;  typical  component  states  are  GOOD,  BAD,  INJQSE,  UN¬ 
USED.  Components  may  fail;  failures  may  trigger  recovery  processes  such  as  reconfiguring 
around  a  failed  component  or  replacing  it.  The  time-to-failure  distributions  for  individual 
components  are  taken  to  be  exponential;  the  time-for-recovery  distributions  are  permitted 
to  be  general.  Formally,  the  reliability  model  is  a  semi-Markov  process  [17]. 

A  semi-Markov  state-space  can  be  thought  of  as  a  directed  graph,  each  node  of  which  is  a 
system  state.  A  system  state  is  typically  described  as  a  vector  of  integer-valued  component 
state  values;  the  system  state  changes  when  some  component  state  changes,  for  example,  if 
a  processor  or  communication  link  fails.  The  amount  of  time  the  system  spends  in  a  state 
is  random,  and  is  known  as  the  holding  time.  Arcs  out  of  a  state  describe  transitions  that 
are  possible  from  that  state.  The  behavior  of  the  modeled  system  over  a  time  interval  [0,  T] 
is  therefore  described  as  a  path  through  the  state-space,  with  the  sum  of  holding  times  of 
states  on  the  path  being  at  least  T.  Given  any  path  through  the  state-space,  there  is  a 
probability  that  the  system’s  behavior  in  [0,  T]  is  described  precisely  by  that  path. 

The  system  is  considered  to  have  failed  when  certain  problem-dependent  criteria  are  met. 
For  example,  a  critical  component  may  have  a  spare,  but  after  both  the  component  and  its 
spare  have  failed  the  system  may  be  unable  to  function.  A  system  state  reflecting  some 
failure  condition  is  known  as  a  death-state.  The  tools  discussed  here  estimate  the  transient 
probability  that  the  system  enters  any  death-state  within  a  mission  time  T. 

We  have  implemented  a  parallelized  reliability  analysis  tool  based  on  mathematics  dis¬ 
covered  by  White  [19],  and  two  existing  tools  developed  by  Butler  and  Johnson,  all  at  the 
NASA  Langley  Research  Center.  SURE  [2]  was  developed  first.  It  accepts  a  fully  expanded 
semi-Markov  state-space  which  describes  the  reliability  structure  of  a  given  problem,  and 
determines  upper  and  lower  bounds  on  the  probability  of  reaching  any  model  death-state 
within  the  mission  time  T.  Each  state  in  the  SURE  input  file  is  identified  by  a  unique  integer 
(not  vector)  value.  It  was  quickly  recognized  that  SURE  state-spaces  are  tedious  to  build 
by  hand,  especially  given  their  size  and  the  non-intuitive  nature  of  state  identifiers.  ASSIST 
[9]  was  developed  to  automatically  generate  a  SURE  state-space  from  a  compact  and  intu¬ 
itive  description  of  the  state  space.  The  reliability  modeler  describes  a  system  state  as  a 
vector  of  component  states,  and  uses  a  simple  language  to  describe  death-state  and  search- 
pruning  conditions,  to  describe  conditions  under  which  a  particular  type  of  state  transition 
may  occur,  and  how  a  state  vector  changes  in  response  following  a  particular  transition. 
ASSIST  accepts  the  model  description,  and  then  generates  a  file  containing  the  entire  SURE 
state-space.  A  relatively  small  ASSIST  model  can  describe  a  very  large  SURE  state-space. 

Both  ASSIST  and  SURE  offer  many  features,  perform  a  great  deal  of  error  checking,  and 
present  polished  interfaces  to  their  users.  These  tools  are  in  use  at  approximately  forty-five 
industrial  and  government  sites. 

Reliability  models  with  large  state-spaces  tax  both  ASSIST  and  SURE.  One  problem  is 
simply  that  of  state-space  size — models  have  been  known  to  completely  exhaust  a  50  MB 
disk  partition  allocated  to  virtual  memory.  The  other  problem  is  computational — the  model 
analysis  requires  a  traversal  of  every  path  from  the  starting  state  to  any  death-state.  A 
typical  large  model  may  have  states  numbering  in  the  thousands  to  tens  of  thousands  and 
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have  an  order  of  magnitude  more  transitions.  Large  models  create  a  combinatorial  explosion 
in  the  amount  of  work  that  must  be  done. 

We  have  developed  a  tool,  ASSURE,  that  dramatically  reduces  the  time  required  to 
analyze  ASSIST  models.  Unlike  SURE,  ASSURE  analyzes  system  states  that  are  explic¬ 
itly  represented  as  vectors  of  state  components.  Like  ASSIST,  ASSURE  uses  the  ASSIST 
language  rules  to  determine  whether  a  given  state  is  a  death-state,  and  to  compute  the  tran¬ 
sitions  out  of  the  state.  The  most  important  innovation  of  ASSURE  over  ASSIST— »SURE 
is  that  ASSURE  simultaneously  generates  and  analyzes  paths  through  the  state-space.  AS¬ 
SURED  space  requirements  are  dramatically  smaller,  because  ASSURE  does  not  maintain 
the  entire  state-space  in  memory.  ASSURE  processes  a  system-state  by  analyzing  it,  after 
which  it  generates  the  state’s  descendents.  The  memory  used  used  to  represent  that  system 
state  (and  the  path  to  it)  are  then  discarded.  A  second  innovation  involves  our  internal  rep¬ 
resentation  of  a  system  state,  and  a  path  to  it.  The  innovation  is  made  possible  by  the  model 
analysis  mathematics.  The  heart  of  SURE  (and  thus  ASSURE)  is  a  theorem  that  places  up¬ 
per  and  lower  bounds  on  the  probability  of  the  system  traversing  a  given  path  through  the 
state-space  within  a  given  amount  of  time  [19].  When  the  mission  time  T  is  small  relative 
to  the  mean  component  lifetime,  the  formulae  comprising  these  bounds  involve  sums  and 
products  of  characteristics  of  states  along  the  path,  implying  that  one  can  accumulate  these 
characteristics  during  the  search,  instead  of  saving  each  individual  characteristic.  ASSURE 
is  centered  around  a  “path-record”  data  structure  that  contains  a  system  state  vector  and 
all  the  accumulated  characteristics  of  some  path  from  the  initial  state  to  that  system  state. 
The  memory  required  to  store  a  path-record  does  not  change  as  paths  are  extended  through 
transitions.  ASSURE  iteratively  removes  a  path-record  from  a  work-list,  determines  whether 
the  associated  state  is  a  death-state,  should  be  pruned,  or  generates  transitions.  Discovery 
of  a  death-state  prompts  calculation  of  the  upper  and  lower  bounds  of  reaching  that  state 
by  the  path  whose  accumulated  characteristics  are  recorded  in  the  path  record.  Generated 
transitions  result  in  new  path-records  that  are  attached  to  the  front  of  a  work  list1.  The 
memory  space  for  the  analyzed  path-record  is  reclaimed,  and  the  process  repeats. 

Our  use  of  path-records  was  largely  motivated  by  our  anticipated  need  for  dynamic  load¬ 
balancing.  A  path-record  may  be  processed  on  any  processor;  at  any  time  a  processor 
may  share  its  load  by  removing  a  number  of  path-records  from  its  work-list,  and  sending 
them  to  another  processor.  The  recipient  simply  inserts  them  into  its  own  work  list.  The 
design  decision  paid  off.  Dynamic  load-balancing  schemes  were  straightforward  to  implement, 
given  the  simplicity  of  the  work-list  and  path-record  data  structures.  However,  this  design 
decision  constrains  the  utility  of  ASSURE  to  models  where  the  mission  time  is  small  relative 
to  the  mean  component  lifetime.  SURE  uses  different  mathematics  to  compute  bounds 
for  large  mission  times.  This  mathematics  requires  that  all  characteristics  of  the  given 
path  be  individually  known,  not  simply  accumulated.  Consequently,  the  longer  a  path 
becomes,  the  more  memory  is  required  to  save  its  salient  characteristics.  ASSURE  could 
be  modified  to  incorporate  long  mission  time  analysis;  at  the  time  ASSURE  was  designed 

:By  placing  newly  generated  path-records  at  the  front  of  the  work-list  ASSURE  is  employing  a  depth-first 
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we  sought  to  simply  demonstrate  that  many  (not  necessarily  all)  of  the  problems  solved  by 
ASSIST— »SURE  could  be  effectively  parallelized. 


3  The  ASSIST  Language 

The  most  fundamental  idea  we  use  in  ASSURE  was  first  exploited  by  ASSIST:  semi-Markov 
processes  of  interest  can  be  parametrically  described  by  a  simple  grammar.  To  show  the 
power  of  this  idea,  we  now  sketch  the  main  features  of  the  ASSIST  language. 

Each  state  in  a  semi-Markov  reliability  model  describes  a  possible  system  state  in  terms 
of  factors  affecting  reliability.  These  states  are  typically  expressible  as  vectors  of  integers, 
e.g.,  the  number  of  working  processors,  whether  a  communication  bus  has  failed  (0/1),  the 
number  of  spare  processors.  The  system  makes  a  transition  from  a  given  state  when  the 
value  of  some  state  vector  component  changes  due  to  an  additional  failure  or  repair. 

ASSIST  exploits  the  fact  that  very  many  transitions  can  be  described  parametrically. 
For  example,  imagine  that  a  state  has  n  working  processors,  each  of  which  has  a  constant 
failure  rate  A.  The  rate  at  which  the  system  makes  a  transition  due  to  processor  failure 
is  nA.  This  single  description  characterizes  a  particular  type  of  transition  for  all  values  of 
n.  ASSIST  recognizes  parametric  transitions  using  the  TRANTO  statement.  A  TRANTO 
statement  consists  of  a  Boolean  conditional  which  indicates  when  the  transition  is  permitted 
to  occur,  a  destination  expression  which  describes  how  the  state  is  to  be  modified  following 
the  transition,  and  a  rate  statement  which  specifies  the  transition  rate  due  to  the  specified 
transition.  For  example,  a  state  may  consist  of  a  vector  <NP,NS,F>  where  NP  denotes  the 
current  number  of  working  processors,  NS  is  the  current  number  of  spare  processors,  and  F 
is  a  flag  indicating  whether  a  failed  processor  is  being  replaced  in  this  state.  Consider  the 
following  TRANTO  statement: 

IF  F=0  TRANTO  NP=NP-1,F=1  BY  NP* LAMBDA 

This  type  of  transition  may  only  occur  if  no  previous  processor  failure  is  still  being  repaired, 
i.e.,  when  F  =  0.  In  making  the  transition,  the  NP  component  of  the  state  is  decremented  to 
reflect  the  failed  processor,  and  the  F  component  is  set  to  1  to  indicate  that  a  failed  processor 
is  being  replaced.  The  cumulative  rate  at  which  this  transition  occurs  is  NP*LAMBDA,  where 
LAMBDA  is  the  processor  failure  rate.  A  preamble  in  the  ASSIST  file  declares  that  variables 
NP  and  F  are  components  of  the  system  state  vector,  gives  them  initial  values,  and  quantifies 
LAMBDA. 

A  failed  processor  may  be  replaced  by  a  spare.  The  TRANTO  statement 

IF  F=1  AND  NS>0  TRANTO  NP=NP+1,  F=0,  NS=NS-1  BY  <REPMEAN,REPSD> 

describes  this  transition.  Here  REPMEAN  is  the  mean  time  of  this  transition,  and  REPSD  is 
its  standard  deviation.  This  syntax  flags  the  transition  as  being  associated  with  a  recovery 
process,  and  gives  the  information  that  SURE  will  need  when  considering  this  transition. 

Finally,  it  may  happen  that  a  processor  will  fail  while  another  is  being  replaced.  This  is 
a  condition  that  causes  the  system  to  fail,  and  can  be  flagged  by  setting  the  F  variable  to  2. 
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IF  F=1  TRANTO  F=2  BY  NP+LAMBDA 


Ultimately  we  are  interested  in  the  states  where  the  system  is  considered  to  have  failed. 
The  DEATHIF  statement  describes  such  conditions.  As  we  have  mentioned,  the  system  is 
considered  to  have  failed  if  a  co-incident  fault  occurs.  This  is  indicated  with  the  statement 

DEATHIF  F=2 

Another  type  of  failure  occurs  if  a  processor  failed  but  there  are  no  spares.  This  condition 
is  indicated  with  the  statement 

DEATHIF  F=1  AND  NS=0 

ASSIST  files  may  also  specify  that  a  search  be  “pruned”,  meaning  that  a  path  is  not 
extended.  Pruning  is  quite  important  for  reducing  the  complexity  of  a  search,  allowing  the 
search  to  concentrate  on  paths  having  the  highest  probability  of  traversal.  The  PRUNEIF 
statement  identifies  conditions  for  pruning  a  search  in  the  same  way  that  the  DEATHIF 
statement  identifies  a  death-state.  One  can  also  prune  by  probability — prune  if  the  upper 
bound  becomes  sufficiently  small.  A  number  of  other  ASSIST  features  are  largely  self- 
explanatory.  Figure  1  illustrates  a  complete,  more  complex  ASSIST  model.  The  system 
being  modeled  is  composed  of  three  fault-tolerant  subsystems;  the  system  is  up  if  and  only  if 
all  three  subsystems  are  up.  Each  fault-tolerant  subsystem  is  composed  of  three  components, 
each  having  a  distinct  failure  rate.  A  subsystem  is  up  if  any  two  of  its  components  are  up. 

The  ASSIST  syntax  is  intentionally  designed  to  aid  the  automated  generation  of  state 
transitions.  Given  a  state  vector  we  can  easily  determine  (i)  whether  the  state  is  a  death- 
state  (and  hence  has  no  transitions),  (ii)  whether  the  state  meets  any  pruning  criteria  (again, 
no  transitions),  or  (iii)  the  transitions  permitted  from  the  state.  For  example,  consider  the 
processing  of  the  initial  state  in  Figure  l’s  example.  None  of  the  death-state  conditions  are 
satisfied,  because  all  C,S  and  V  components  have  value  1.  The  single  pruning  condition  is 
not  met,  as  NCF  =  0.  However,  a  number  of  transitions  may  occur.  Any  one  of  the  C,S  or  V 
components  may  fail,  leading  to  a  state  vector  which  is  identical  to  the  initial  state,  except 
that  the  status  bit  for  the  failed  component  is  now  0,  and  the  NCF  component  is  1.  There 
are  nine  such  states  reachable  from  the  initial  state.  Each  of  these  may  reach  eight  other 
states,  and  so  on. 

In  the  course  of  developing  ASSURE  it  became  clear  that  the  sparsity  of  ASSIST’s 
syntax  made  it  difficult  to  design  reliability  models  of  computer  networks  that  incorporate 
sophisticated  recovery  algorithms.  For  example,  consider  a  network  that  statically  routes 
messages,  i.e.,  the  same  path  is  used  for  every  message  between  a  given  sender  and  receiver 
pair.  Suppose  that  the  network  has  many  redundant  routing  nodes  and  communication 
links.  The  network  is  temporarily  disabled  whenever  a  routing  node  or  link  on  a  static 
route  fails.  However,  the  network  can  be  made  operational  if  a  new  route  using  different 
nodes  and/or  links  can  be  found  to  replace  the  failed  one.  Therefore,  to  determine  whether 
the  network  can  be  made  functional  following  a  link  or  node  failure,  we  must  essentially 
determine  the  post-failure  network  connectivity.  Virtually  the  only  way  of  using  ASSIST 
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LAMBDA.S  =  7.62E-4; 
TIME  -  10.0; 

LAMBDA. V  =  3.9E-4; 
LAMBDA.C  =  3.5E-4; 


SPACE=(C:  ARRAYC1..3]  OF  0..1, 

S:  ARRAYC1..3]  OF  0..1, 

V:  ARRAY [1. .3]  OF  0..1, 

NCF) ; 

START  =  (1,1, 1,1, 1,1, 1,1, 1,0); 

DEATHIF  (C[l]+S[l]+V[l]  <2  )  OR 
(C[2]+S[2]  +V  [2]  <  2)  OR 
(C [3] +S [3] +V[3]  <  2); 

PRUNEIF  NCF  >=  5; 

FOR  I *=1,3; 

IF  C[IJ  >0  TRANTO  C[I]  =0,  NCF  =  NCF+1  BY  LAMBDA.C; 

IF  SCI]  >0  TRANTO  S[I]  =0,  NCF  =  NCF+1  BY  LAMBDA.S; 

IF  V[I]  >  0  TRANTO  V[I]  =0,  NCF  =  NCF+1  BY  LAMBDA.V; 

ENDFOR; 


Figure  1:  An  Example  ASSIST  Model 


to  test  connectivity  (e.g.  in  a  DEATHIF  statement)  is  to  exhaustively  enumerate  all  of  the 
state  conditions  under  which  the  network  is  disconnected.  This  is  clearly  unsatisfactory  for 
all  but  the  smallest  networks.  To  deal  with  this  problem  we  extended  the  ASSIST  syntax. 

We  will  later  see  that  ASSURE  translates  ASSIST  models  into  C  functions.  Given  this 
approach  it  was  natural  to  extend  ASSIST  by  permitting  explicit  reference  to  C  functions 
in  DEATHIF  and  TRANTO  statements.  These  functions  are  permitted  to  read  (and  in  the 
case  of  TRANTO  post-conditionals,  write)  ASSIST  system  state  variables  as  though  they 
were  ordinary  C  integers  or  arrays  of  integers.  Instead  of  constructing  Boolean  conditionals 
to  identify  death  or  transition  conditions,  we  permit  a  call  to  a  user-written  C  function  that 
computes  and  returns  a  0/1  value.  In  the  network  example  above,  we  might  write  a  C  func¬ 
tion  Disconnected ()  that  determines  the  network  connectivity  as  a  function  of  the  current 
system  state.  Extended  ASSIST  thejysermits  the  statement  DEATHIF  DisConnectedO.  It 
is  permissible  to  include  function  arguments  in  these  C  functions.  Extended  ASSIST  also 
permits  the  user  to  define  and  initialize  read-only  data  structures  that  may  be  referenced 
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from  within  the  C  functions.  For  example,  this  feature  is  useful  for  computing  and  storing 
the  network  topology,  which  otherwise  would  have  to  be  expressed  in  ASSIST  in  terms  of 
constants  or  one-dimensional  arrays  of  constants. 


4  Model  Analysis 

The  mathematics  underlying  SURE  permit  one  to  place  upper  and  lower  bounds  on  the 
probability  of  the  system  traversing  a  given  path  within  the  mission  time,  T.  These  bounds 
depend  on  a  classification  of  each  transition  on  the  path  into  one  of  three  classes.  These 
classes  are  explained  below,  along  with  definitions  needed  to  express  the  bounds. 

Class  1:  Class  1  is  composed  of  failure  transitions  that  occur  in  states  from  which  there 
are  only  failure  transitions.  Characteristics  of  this  transition  are  its  rate,  and  the  sum 
of  the  rates  of  all  other  transitions  from  this  state. 

Class  2:  A  Class  2  transition  is  a  recovery  transition.  Characteristics  of  this  transition 
are  the  sum  of  all  failure  transitions  from  this  state,  the  probability  that  this  recovery 
succeeds  over  every  other  recovery  transition  from  this  state,  and  the  mean  and  variance 
of  this  transition  given  that  it  is  taken. 

Class  3:  A  Class  3  transition  is  a  failure  transition  from  a  state  that  also  contains  a 
recovery  transition.  Characteristics  of  this  transition  are  its  rate,  the  sum  of  rates 
of  other  failure  transitions  from  this  state,  the  probability  that  the  specified  recovery 
transition  succeeds  over  all  other  recovery  transitions  from  this  state,  the  mean  and 
variance  of  the  recovery  transition  given  that  it  is  taken. 

A  technically  complete  description  of  these  transition  characteristics  can  be  found  elsewhere 
[2],  and  is  not  needed  to  describe  ASSURE  processing  at  a  high  level2. 

A  statement  of  the  SURE  theorem  in  terms  that  suit  our  purposes  is  given  below. 

Theorem  1  (White)  Let  p  be  a  path  composed  of  k  Class  1  transitions,  m  Class  2  tran¬ 
sitions,  and  n  Class  S  transitions.  The  probability  D(p,  T)  of  taking  a  path  p  through  the 
state-space  within  the  mission  time  T  can  be  bounded  as  follows 

L(p,T)<D(p,T)<U(p,T). 

L{p,T)  and  U(p,T )  have  the  form 

k  m  n 

l(p,  n = n  <#'(?. T)  n  cJffe  t)  n  n 

i=l  j=l  j=l 

2  A  precise  definition  of  these  characteristics  and  the  SURE  theorem  requires  several  pages  of  mathematics 
whose  details  are  not  essential  for  understanding  ASSURE. 
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k  m  n 

v(p,t) = n  c$\p,  t)  n  <’(p,  t )  n  c$\P,  -n 

i= i  i=i  i=i 

where  each  C\f\p,T)  and  C^\p,  T)  is  a  function  only  of  the  number  of  Class  i  transitions, 
T,  and  the  characteristics  of  the  jth  Class  i  transition  on  p. 

□ 

This  nice  structure  of  U(p,T)  and  L(p,T)  can  be  exploited  for  parallel  processing.  In  the 
course  of  expanding  p  we  do  not  need  to  store  each  individual  transition’s  characteristics. 
We  can  “accumulate”  them  as  appropriate,  thereby  saving  space.  For  example,  one  of  the 
product  functions  comprising  U(p,T )  is 

AiA2...AfcTfc 

k\ 

where  k  is  the  number  of  Class  1  transitions  in  p,  and  the  A^’s  are  their  individual  rates. 
When  a  Class  1  transition  with  rate  A  is  taken  one  need  only  increment  a  Class  1  transition 
counter,  and  multiply  a  rate  product  accumulator  by  A.  Once  a  death-state  or  pruning- 
condition  terminates  the  expansion  of  p ,  then  U(p,T)  and  L(p,T)  can  be  computed  from 
the  accumulated  characteristic  data,  and  the  upper  and  lower  bounds  on  the  system  failure 
probability  can  be  adjusted. 

SURE  accepts  a  description  of  a  directed  graph  representing  a  state-space.  From  the 
initial  node  S  it  initiates  a  depth-first-search  of  the  graph.  Whenever  a  node  L  without 
descendants  is  encountered  the  SURE  theorem  is  applied  to  the  specific  path  from  S  to  L. 
Death-state  nodes  accumulate  the  upper  and  lower  bounds  of  paths  that  reach  them.  Once 
an  exhaustive  all-paths  traversal  has  been  performed  one  determines  the  overall  upper  and 
lower  bound  by  summing  the  bounds  associated  with  each  death-state  node.  One  can  also 
have  the  bounds  for  individually  specified  death-states  printed. 

State  space  graphs  may  be  cyclic,  so  that  an  all-paths  traversal  of  the  graph  will  never 
end.  SURE  handles  the  problem  by  always  checking  to  see  if  the  “next”  node  in  a  path 
is  already  on  the  path,  forming  a  cycle.  Cycles  are  “unrolled”  for  a  user  specified  number 
of  iterations,  and  the  path  is  carefully  truncated  using  equations  developed  in  [2].  Another 
form  of  pruning  is  to  simply  terminate  a  path  when  the  upper  bound  on  the  probability 
of  taking  the  path  falls  below  a  user-specified  pruning  threshold.  The  overall  upper  bound 
on  the  failure  probability  is  incremented  by  the  path’s  upper  bound,  reflecting  the  worst 
case  scenario  where  every  node  immediately  reachable  from  the  pruned  path  represents  a 
death-state.  ASSURE  uses  only  the  later  method.  It  has  been  shown  that  pruning  of  this 
type  is  sufficient  to  terminate  loops3. 

5  ASSURE 

ASSURE  combines  the  respective  functions  of  ASSIST  and  SURE  into  one  integrated  tool. 
At  the  same  time  as  ASSURE  identifies  transitions  out  of  a  system  state  (as  does  ASSIST), 

3Private  communication  from  Ricky  Butler  and  Alan  White. 
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Figure  2:  Creating  an  ASSURE  program 


it  analyzes  those  same  transitions  (as  does  SURE).  Not  only  are  we  then  able  to  analyze 
the  problem  in  a  single  “pass”,  we  achieve  substantial  savings  in  memory  space  as  well, 
permitting  ASSURE  to  analyze  problems  that  defeat  ASSIST— >SURE.  We  will  later  see 
that  excellent  parallel  run-time  efficiencies  are  also  achieved. 

We  used  the  UNIX  tools  Lex  and  Yacc  [10]  to  describe  the  ASSIST  grammar,  and  build  an 
ASSIST  language  parser  which  transforms  ASSIST  models  into  C  functions.  As  a  first  step, 
one  builds  the  ASSIST  parser,  and  compiles  problem  independent  control  code.  This  step 
happens  only  once.  Following  it,  every  ASSIST  model  undergoes  two  processing  phases.  In 
the  pre-processing  phase  an  ASSIST  model  is  parsed,  transformed  into  C  functions  which  are 
then  compiled  and  linked  with  the  pre-compiled  control  routines.  The  pre-processing  phase 
is  followed  by  an  execution  phase  where  the  problem  is  solved.  These  steps  are  illustrated  in 
Figure  2.  The  pre-processing  and  execution  phases  will  next  be  described  in  more  detail. 

5.1  Pre-processing  Phase 

The  pre-processing  phase  is  concerned  with  the  translation  of  an  ASSIST  model  into  C  rou¬ 
tines  that  recognize  death-states,  pruning  conditions,  and  generate  transition  states.  The 
ASSIST  language  syntax  is  closely  related  to  the  syntax  of  imperative  programming  lan¬ 
guages  like  Pascal  and  C.  It  is  a  conceptually  simple  matter  to  transform  any  Boolean 
expression  of  state  variables  expressed  in  ASSIST  into  a  corresponding  expression  in  C. 


int  CheckDeath(ptr) 

struct  PathRecord  *ptr;  /*  pointer  to  a  path-record  */ 

if ( (ptr->State [0] +ptr->State [3] +ptr->State [6] <2)  1 1 
(ptr->State [1] +ptr->State [4] +ptr->State [7] <2)  1 1 
(ptr->State [2] +ptr->State [5] +ptr->State [8] <2) ) 

-[  AnalyzeDeathState(ptr) ; 
retumCl) ; 

> 

return (0) ; 

> 


Figure  3:  Translated  C  routine  to  recognize  death-states 


Therefore,  if  we  can  map  system  state  variables  onto  C  language  variables  we  can  recognize 
death-state,  pruning,  and  transition  conditions.  It  is  also  straightforward  to  translate  AS¬ 
SISTS  looping  constructs  and  the  state  modification  statements  following  a  TRANTO  into 
C  language  statements. 

During  the  pre-processing  stage  an  ASSIST  file  is  parsed.  References  to  state  variables 
are  translated  into  references  to  particular  offsets  within  a  system  state  array  contained 
in  a  path-record.  Problem  specific  C  functions  are  generated  for  recognizing  death-states, 
recognizing  pruning  conditions,  and  generating  transitions.  ASSURE  control  code  passes 
a  pointer  to  the  path-record  of  interest  to  these  routines.  The  routines  use  the  pointer  to 
access  individual  state  variables.  Consider  the  problem  expressed  in  Figure  1.  The  system 
state  is  stored  in  the  ten  element  array  State  contained  in  the  path-record.  Arrays  C,S,  and 
V  are  packed  into  State  beginning  at  locations  0,3,  and  6  respectively.  NCF  occupies  location 
9.  Let  ptr  be  a  pointer  to  a  path  record,  and  AnalyzeDeathStateO  be  a  routine  called 
when  a  death-state  is  recognized.  The  ASSURE  pre-processing  stage  will  produce  an  integer 
function  CheckDeath(ptr)  (shown  in  Figure  3)  that  can  be  called  to  determine  if  the  state 
pointed  to  by  ptr  is  a  death-state4.  The  reader  unfamiliar  with  C  can  interpret  this  code  by 
keeping  in  mind  that  “struct”  variables  are  basically  records,  that  record  fields  are  accessed 
through  pointers  with  the  ->  symbol,  and  that  1 1  denotes  a  logical  OR. 

Pruning  conditions  are  checked  in  entirely  the  pame  manner. 

If  a  path-record  survives  the  death  and  pruning  tests  it  is  passed  to  a  routine  TranTo  (ptr) 
that  generates  a  list  of  path-records  corresponding  to  the  transition  states  reachable  from 
the  state  pointed  to  by  ptr.  TranTo  (ptr)  tests  each  TRANTO  condition;  whenever  one  is 
satisfied  a  copy  of  ptr’s  path  record  is  made,  selected  components  are  modified  as  described 
in  the  ASSIST  model;  the  modified  path-record  is  then  attached  to  the  list  of  transition  path- 

4The  code  shown  in  Figures  3  and  4  is  equivalent  to  what  is  actually  produced,  but  is  far  more  readable. 
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struct  PathRecordList  *TranTo(ptr) 
struct  PathRecord  *ptr; 

{ 

struct  PathRecordList  *TList; 

struct  PathRecord  *TranPtr,*CopyPathRecord() ; 

int  I; 

TList  =  NULL; 
for (1=1;  I<=3;  I++) 

{ 

if (ptr->State[0+I-l]  >  0) 

{  TranPtr  =  CopyPathRecord(ptr) ;  /*  make  a  copy  */ 

TranPtr~>rate  =  LAMBDA.C;  /*  save  transition  rate  */ 

TranPtr->State[0+I-l]  =  0; 

TranPtr~>State [9]  =  ptr->State[9]-l; 

Attach(TranPtr, TList) ;  /*  attach  to  tranto  list  */ 

> 

if (ptr->State [3+1-1]  >  0) 

{  TranPtr  *  CopyPathRecord(ptr) ;  /*  make  a  copy  */ 

TranPtr->rate  ■  LAMBDA_S ;  /*  save  transition  rate  */ 

TranPtr->State [3+1-1]  ■  0; 

TranPtr->State [9]  ■  ptr->State[9]-l; 

Attach (TranPtr, TList) ;  /*  attach  to  tranto  list  */ 

> 

if (ptr->State [6+1-1]  >  0) 

{  TranPtr  =  CopyPathRecord(ptr) ;  /*  make  a  copy  */ 

TranPtr->rate  *  LAMBDA_V;  /*  save  transition  rate  */ 

TranPtr->State [6+1- 1]  =  0; 

TranPtr->State [9]  =  ptr->State[9]-l; 

Att ach(TranPtr, TList ) ;  /*  attach  to  tranto  list  */ 

} 

> 

return(TList) ; 

> 


Figure  4:  Translated  C  routine  to  generate  transitions 


records.  The  procedure  generated  for  our  example  problem  is  given  in  Figure  4.  Function 
CopyPathRecord(ptr)  allocates  a  block  of  dynamic  memory  for  a  new  path-record,  copies 
the  contents  of  the  path- record  pointed  to  by  ptr,  and  returns  a  pointer  to  the  newly  allocated 
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block.  Function  Attach(TranPtr,TList)  links  the  newly  modified  path- record  onto  a  list  of 
transition  states  generated  by  the  state  pointed  to  by  ptr.  The  expressions  used  to  compute 
indices  into  the  State  vector  look  curious  at  first  glance.  They  are  the  product  of  a  simple 
translation  process  that  transforms  an  ASSIST  index  into  a  State  vector  index  by  adding 
an  offset  associated  with  the  state  variable,  and  subtracting  one  (because  C  arrays  begin 
with  index  0).  Any  ordinary  compiler  will  combine  the  constants  expressed  in  the  index. 

It  would  be  possible  to  write  problem  independent  routines  that  interpret  an  ASSIST 
model,  but  it  seemed  to  us  that  the  run-time  overhead  of  continually  interpreting  text  would 
soon  exceed  the  pre-processing  cost  of  parsing  the  ASSIST  file,  and  compiling  the  resulting 
routines.  On  the  realistic  problems  we  study,  the  pre-processing  delay  is  small. 


5.2  Execution  Phase 

Next  we  discuss  ASSURB’s  execution  phase.  First  we  describe  the  serial  algorithm,  and  then 
how  it  is  parallelized. 

The  execution  phase  begins  once  the  ASSIST  file  has  been  translated,  compiled,  and 
linked  with  problem  independent  code.  Like  SURE,  ASSURE  explores  the  state  space  using 
a  depth-first  traversal.  To  start  the  processing,  a  path-record  describing  the  starting  state  is 
placed  in  a  list  of  path-record'j,  WorkList.  Processing  then  consists  of  iteratively  selecting  a 
path-record  from  the  front  of  WorkList,  testing  the  system  state  S  it  contains  for  death-state 
and  pruning  conditions,  determining  all  of  the  states  reachable  from  S  and  placing  path- 
records  for  them  at  the  front  of  WorkList.  Processing  is  complete  when  WorkList  is  empty. 
The  serial  form  of  this  algorithm  is  shown  in  Figure  5.  The  code  is  intended  to  be  largely 
self-explanatory;  functions  of  type  void  return  no  values,  and  the  “!”  operator  applied  to  an 
integer  is  a  Boolean  1  if  the  integer  is  zero,  and  is  a  Boolean  0  otherwise. 

ASSURE  was  designed  to  permit  a  straightforward  parallelization  of  the  execution  phase 
on  a  distributed  memory  multiprocessor.  We  execute  the  algorithm  above  on  one  processor 
until  the  WorkList  has  at  least  as  many  path- records  as  there  are  processors.  The  path- 
records  are  then  partitioned  evenly  among  all  processors,  and  placed  into  the  individual 
WorkLists.  Each  processor  then  executes  the  serial  algorithm.  Whenever  a  path-record 
reveals  a  death  or  pruned  state  the  path’s  probability  bounds  are  computed  and  accumulated 
in  variables  that  are  local  to  the  processor.  The  fact  that  a  single  processor’s  WorkList  goes 
empty  does  not  imply  that  the  computation  is  completed  (so  that  the  PrintBoundsC)  routine 
is  not  immediately  called).  An  idle  processor  may  reseed  its  WorkList  and  continue  by  asking 
for  and  receiving  any  path-record  from  a  processor  that  has  an  overabundance  of  them. 
The  issue  of  load-balancing  is  one  we  will  later  discuss  in  more  detail.  The  computation 
is  complete  when  the  WorkList  in  every  processor  is  empty.  At  this  point  one  needs  to 
accumulate  the  upper  bounds  computed  in  all  processors,  and  likewise  accumulate  the  lower 
bounds.  These  aggregate  sums  yield  the  overall  upper  and  lower  bounds  on  the  probability 
of  entering  a  death-state  within  the  mission  time. 

Key  to  the  approach  is  the  fact  that  any  path-record  may  analyzed  on  any  processor. 
One  should  appreciate  the  fact  that  this  need  not  have  been  the  case,  and  is  in  fact  a 
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Anal yzePr obi em ( ) 

-C 

/*  Declaration  local  variables  and  forward  function  references  */ 
struct  PathRecordList  *WorkList,*descendents; 
struct  PathRecord  *path,*Dequeue() ; 
void  InitializeO ,  ExtendPathO ,  Enqueue(), 

ReleaseC),  PrintBoundsO ; 
int  EmptyO  ,  CheckDeathQ  ,  PrunePathO  ; 

/*  Start  working  */ 

Initialize(WorkList) ; 
while ( ! Empty (WorkList) ) 

{  path  «  Dequeue (WorkList) ; 
if(  ICheckDeath(path)  && 

IPrunePath(path)  ) 

{  descendents  =  TranTo(path) ; 

ExtendPath(descendents) ; 

Enqueue (descendents, WorkList) ; 

> 

Release (path) ; 

> 

PrintBoundsO ; 

> 


/*  load  starting  state  */ 

/*  while  work  to  do  */ 

/*  get  first  path-record  */ 

/♦if  path  survives  */ 

/*  generate  descendents  */ 

/*  classify  transitions, 
accumulate  new  path 
characteristics  ♦/ 

/*  stick  new  path-records 
at  front  of  WorkList  */ 

/♦  recover  memory  space  */ 

/*  report  the  reliability  */ 


Figure  5:  ASSURE  Algorithm 


design  decision  with  significant  consequences.  For  example,  suppose  one  needed  to  know  the 
upper  and  lower  bounds  on  reaching  each  individual  death-state.  This  sort  of  information 
is  easily  provided  by  ASSIST— »SURE,  but  cannot  be  provided  by  ASSURE,  at  least  in 
its  present  form.  Since  the  state-space  is  not  bound  to  processors,  the  contributions  to 
any  given  death-state’s  bounds  may  be  distributed  among  processors.  No  note  is  made 
of  the  state  when  the  information  from  a  death-state  or  pruned-state  is  incorporated  into 
the  bounds.  Another  consequence  is  that  cycles  in  the  state-space  cannot  be  explicitly 
recognized.  Because  SURE  maintains  all  the  information  it  may  need  about  a  path,  it  can 
recognize  a  cycle  and  accurately  prune  paths  that  endlessly  traverse  the  cycle.  ASSURE 
cannot  recognize  cycles,  because  it  only  accumulates  information  about  a  path,  it  does  not 
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maintain  a  history  of  a  path’s  states.  Our  interest  was  in  seeing  if  ASSIST— >SURE  could  be 
parallelized  and  achieve  high  performance  on  an  interesting  class  of  problems.  We  judged 
that  the  features  we  chose  not  to  support  were  not  essential  for  a  demonstration  of  parallel 
processing’s  viability  for  this  application. 

Parallel  ASSURE  exploits  parallelism  by  implementing  a  parallel  depth-first-search  for 
every  terminal  (i.e.,  death-state  or  pruned)  node.  This  type  of  parallelism  has  already  been 
well  studied  [6, 18, 16];  our  contribution  is  to  demonstrate  that  a  general  tool  for  an  important 
application  class  can  benefit  from  parallel  processing,  without  the  user  having  to  be  involved 
with  the  details  of  the  parallelization. 

5.3  Dynamic  Load  Balancing 

A  driving  concern  behind  ASSURE’s  design  was  the  recognition  that  ASSIST— >SURE  prob¬ 
lems  are  very  dynamic  in  the  demands  they  place  on  a  system,  and  that  dynamic  load¬ 
balancing  would  likely  be  needed  to  achieve  high  parallel  run-time  efficiencies.  We  next 
briefly  describe  the  methods  we  used. 

A  summary  of  dynamic  load  balancing  techniques  for  parallel  searching  is  given  in  [11]. 
Methods  described  there  are  asynchronous,  and  local:  when  a  processor  becomes  idle  it  polls 
a  small  subset  of  other  processors,  asking  for  more  work.  Our  own  view  on  load  balancing 
has  been  more  synchronous  and  global  [13,  12,  14,  15]  all  processors  are  involved  in  every 
balancing  of  the  workload.  Each  style  has  its  advantages  and  disadvantages.  The  advantage 
of  an  asynchronous  scheme  is  that  processors  with  work  to  do  are  not  delayed  by  a  load¬ 
balancing  from  which  they  do  not  benefit.  A  disadvantage  is  that  it  is  possible  for  workload 
to  “pile  up”  in  some  localized  subset  of  processors,  after  which  it  may  take  some  time  for 
repeated  local  load-balancing  requests  to  siphon  the  excess  workload  from  that  region.  A 
disadvantage  of  global  schemes  is  that  the  per-balance  overhead  is  higher;  there  is  also 
some  doubt  about  the  scalability  of  global  methods.  An  advantage  is  that  a  global  scheme 
treats  potential  balancing  problems  as  well  as  existing  ones.  When  a  global  method  evenly 
distributes  all  existing  workload,  processors  that  may  soon  be  empty  are  given  a  transfusion 
of  work  before  it  is  actually  needed.  Therefore,  one  expects  that  a  global  load-balancing 
method  will  be  called  less  often  than  a  local  method.  A  final  advantage  follows  from  the  fact 
that  long  messages  are  preferred  on  distributed  memory  multiprocessors,  as  the  large  fixed 
communication  startup  cost  is  amortized  over  more  bytes.  Global  methods  move  more  data, 
and  so  achieve  a  lower  per-byte  communication  overhead. 

We  explored  the  use  of  global  methods,  both  because  our  experience  has  been  in  using 
such  methods,  and  because  of  the  possibility  that  state-spaces  we  search  may  have  local 
concentrations  of  high  workload  (this  occurs  in  regions  where  the  modeled  system  is  quickly 
able  to  recover  from  failures.)  Global  methods  are  appropriate  for  quickly  breaking  these 
concentrations  up. 

Two  important  synchronous  load-balancing  algorithms  have  been  discussed  in  the  liter¬ 
ature.  The  “dimension-exchange”  algorithm  [5]  assumes  that  every  processor  has  a  number 
of  independent  jobs;  in  d  exchange  steps  it  completely  balances  the  workload,  d  being  the 
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dimension  of  the  hypercube  (i.e.,  the  system  has  2d  processors).  In  each  step  the  algorithm 
balances  the  workload  through  one  hypercube  dimension,  as  follows.  Let  bd-ibd-2  ■■■bo  be 
the  identity  of  a  processor  expressed  in  binary.  In  step  j  this  processor  balances  the  workload 
(i.e.,  number  o*  jobs)  betweer  itself  and  processor  bd- 1  •  •  •  bj  •  •  •  b0:  the  processors  compare 
their  loads,  and  the  one  with  more  jobs  sheds  enough  so  that  each  have  half  of  their  total. 
If  a  processor’s  load  is  infinitely  divisible,  this  algorithm  is  guaranteed  to  assign  exactly  the 
same  amount  of  load  to  each  processor  by  the  end  of  d  steps.  Observe  that  each  step  actually 
requires  2  communications  per  processor — one  to  inform  the  partner  of  its  load,  the  other  to 
send  or  receive  the  load. 

Load  balancing  techniques  based  on  parallel-prefix  style  computations  called  scans  have 
also  been  suggested  [1].  Our  adaptation  of  this  method  involves  two  steps.  First,  every 
processor  P{  submits  the  length  Li  of  its  WorkList  to  an  enumeration  scan  that  returns  to 
Pi  two  values:  Si  =  £}=o  Tjj  anc^  the  sum  W  of  all  such  list  lengths.  As  discussed  in  [1],  one 
can  accomplish  this  in  2d  parallel  coiiimunication  steps.  Given  5,,  processor  P,  knows  that  if 
all  the  path-records  in  the  system  were  enumerated  increasingly  by  processor  identity,  then 
its  path  records  would  be  numbered  Si  through  Si  +  Li~  1.  The  second  step  is  to  distribute 
the  path-records.  Since  each  processor  knows  the  total  number  of  path-records  in  the  system 
(PF),  it  is  simple  to  evenly  remap  the  path-records  as  a  function  of  their  enumeration  indices. 
For  example,  processor  P0  will  get  path-records  0  through  W/2d- 1,  Pi  will  get  records  W/2d 
through  W/ 2d— 1 — 1,  and  so  on.  Once  a  processor  knows  the  indices  of  its  current  path-records 
it  can  easily  compute  the  identity  of  processors  to  whom  those  path-records  should  be  sent. 
However,  a  processor  is  not  able  to  determine  from  whom  it  will  receive  path-records.  If 
a  processor  receives  an  unanticipated  message,  the  operating  system  keeps  the  message  in 
system  buffer  space  until  a  user  process  asks  for  it,  at  which  point  the  system  copies  the 
entire  message  into  a  buffer  indicated  by  the  user  process.  It  is  easy  to  overflow  the  system 
buffer  space  if  no  message  flow  control  is  used.  Our  implementation  deals  with  this  problem 
by  having  each  processor  send  short  “header”  messages  to  all  processors  to  whom  it  will 
send  path-records.  Since  the  header  messages  are  short,  there  is  no  danger  of  overflowing 
the  system  buffers  into  which  they  are  initially  received.  Following  the  transmission  of  all 
such  header  messages  a  processor  engages  in  a  global  synchronization.  The  processors  then 
logically  receive  their  header  messages  following  the  global  synchronization,  being  assured 
that  all  such  messages  are  resident  in  the  processor.  Thus  forewarned,  a  process  can  set 
up  receive  buffers  in  the  user  space  for  the  path-records  to  follow,  and  again  engage  in  a 
global  synchronization.  Following  this  synchronization  the  path-records  are  sent,  received, 
and  placed  in  the  processor’s  WorkLists.  As  described  above  the  scan-based  method  requires 
at  least  4 d  parallel  communication  steps  before  the  path-records  are  exchanged.  It  does  have 
the  potential  advantage  that  a  path-record  is  transmitted  only  once — a  path-record  may  be 
passed  as  many  as  d  times  using  the  dimension  exchange  algorithm. 

As  we  will  see  in  the  next  section,  the  choice  of  load-balancing  mechanism  has  only 
a  second-order  effect  on  performance.  One  method  may  be  significantly  faster  than  the 
other,  and  yet  the  overall  performance  does  not  change  significantly  because  load-balancing 
is  needed  so  infrequently  that  its  cost  is  not  a  major  contributing  factor  to  the  overall 
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performance.  However,  we  will  see  that  performance  is  very  much  affected  by  whether  any 
dynamic  load  balancing  is  employed. 


6  Performance 

We  evaluated  the  performance  of  ASSURE  on  a  suite  of  ASSIST  models  developed  at  NASA 
Langley  to  analyze  fault  tolerant  communication  networks.  For  each  model  we  measure 
the  time  required  to  solve  the  problem  using  ASSIST— >SURE  on  a  Sun  3/150  workstation, 
the  time  required  by  serial  ASSURE  on  the  same  workstation,  the  time  required  by  serial 
ASSURE  on  a  Cray-2,  and  the  time  required  by  ASSURE  on  one  node,  and  on  32  nodes  of 
the  Intel  iPSC/860  multiprocessor.  The  iPSC/860  is  resident  at  the  Institute  for  Computer 
Applications  in  Science  and  Engineering  at  the  NASA  Langley  Research  Center;  the  Cray-2 
is  also  at  NASA  Langley.  We  find  that  the  parallel  version  of  ASSURE  runs  three  orders  of 
magnitude  faster  than  ASSIST— >SURE.  Roughly  speaking,  one  order  of  magnitude  can  be 
attributed  to  the  algorithmic  improvement  of  ASSURE  over  ASSIST— >SURE,  a  second  order 
of  magnitude  can  be  attributed  to  the  architectural  improvement  of  an  i860  microprocessor 
over  the  68020  used  in  the  Sun  3/150,  a  final  order  of  magnitude  can  attributed  to  the 
exploitation  of  parallelism.  We  find  that  on  the  largest  problems  the  parallel  version  of 
ASSURE  achieves  high  processor  utilizations,  and  solves  the  problems  faster,  by  an  order 
of  magnitude,  than  the  serial  supercomputer.  It  should  be  noted  that  as  designed  ASSURE 
is  an  inherently  scalar  code,  and  so  cannot  benefit  from  the  vector  processing  capabilities 
of  the  Cray5;  it  is  not  cost-effective  to  use  a  Cray  as  a  fast  scalar  processor.  The  point  we 
wish  to  make  with  the  comparison  is  that  traditional  supercomputer  architectures  are  not 
optimal  (or  even  reasonable)  for  this  problem  class.  On  the  other  hand,  distributed  memory 
architectures  appear  to  be  ideal  for  these  problems. 

The  networks  in  the  problem  suite  achieve  reliability  through  redundancy  of  network 
links,  and  use  of  dynamic  reconfiguration  when  network  components  fail.  The  network 
fails  whenever  certain  sequences  of  errors  occur  in  a  small  enough  span  of  time  so  that 
reconfiguration  processes  are  defeated.  The  different  ASSIST  models  vary  in  their  level 
of  detail;  the  simpler  ones  aggregate  certain  aspects  of  network  behavior,  while  the  most 
complex  one  treats  it  in  explicit  detail  using  the  extended  ASSIST  syntax  to  implement 
algorithmically  expressed  state-changes.  The  networks  studied  lead  to  state  vectors  with 
thirty  to  one  hundred  components;  the  most  detailed  network  is  illustrated  in  Figure  6.  The 
network  connects  a  fault  tolerant  processor  (FTP)  with  an  array  of  sensors.  The  FTP  has 
four  channels,  accessing  the  network  through  one  of  six  network  interfaces;  each  interface 
is  connected  to  one  of  two  separate  network  partitions.  Only  one  channel  and  one  network 
partition  is  active  at  a  time.  The  network  partitions  are  comprised  of  redundant  routing 
nodes  and  communication  links,  and  serve  to  connect  the  channels  to  clusters  of  replicated 
sensors.  For  each  of  the  reliability  models  pruning  thresholds  were  selected  so  that  the  sum 

5Any  attempt  to  vectorize  SURE  would  have  to  go  to  a  great  deal  of  trouble  to  construct  appropriate 
vectors.  It  is  not  immediately  obvious  if  or  how  this  might  be  done. 
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Figure  6:  Fault-tolerant  network  studied 


of  the  upper  bounds  of  all  pruned  states  is  at  least  an  order  of  magnitude  smaller  than  the 
sum  of  upper  bounds  on  discovered  death-states. 

Table  1  gives  the  number  of  lines  of  ASSIST  grammar  for  each  model,  and  the  number 
of  path-records  analyzed  by  ASSURE.  The  number  of  lines  given  for  Net4  includes  the  total 
number  of  C  lines  involved  in  the  model. 

Table  2  gives  the  measured  performance  of  ASSIST— »SURE  and  serial  ASSURE.  We 
executed  ASSIST— »SURE  on  a  Sun  3/150  workstation;  serial  ASSURE  ran  on  the  3/150, 
one  node  of  an  iPSO/860,  and  a  Cray-2.  Measurements  are  on  the  Sun  and  Cray-2  are 
of  CPU  time;  iPSC/860  measurements  are  of  elapsed  time.  The  elapsed  time  on  the  Cray 
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Number  of  ASSIST  lines 

Number  of  path-records 

193 

843,378 

269 

724,038 

312 

3,528,778 

1737  (including  C  code) 

27,204,876 

Table  1:  Size  of  model  descriptions  and  number  of  path-records  analyzed 


ASSIST-+SURE 

ASSURE:3/150 

ASSURE:i860 

ASSURE:Cray-2 

I  hr.,  48  min. 

1  hr. 

II  hr.,  30  min. 

N/A 

12  min. 

10  min. 

35  min. 

6  hr.,  12  min. 

63  secs. 

56  secs. 

2  min.,  40  sec. 
39  min.,  40  sec. 

76  secs. 

69  secs. 

2  min.,  45  sec. 

54  min. 

ASSURE’s  serial  performance  on  network  problem  suite 


Model 


NLB 

(time,  utilization) 

DE 

(time, utilization) 

SLB 

(time,  utilization) 

(5.5  sec.,  36%) 
(5.2  sec.,  41%) 
(9.7  sec.,  51%) 
(163  sec.,  45%) 

(3.6  sec.,  55%) 
(3.5  sec.,  61%) 
(8.2  sec.,  61%) 
(88  sec.,  85%) 

(4.0  sec.,  49%) 
(3.9  sec.,  55%) 
(9.4  sec.,  53%) 

(93  sec.,  80%) 

ASSURE’s  parallel  performance  on  network  problem  suite 


Table  2:  Serial  and  parallel  performance  measurements 


timings  is  at  least  seven  times  larger,  due  to  time-shared  multiprogramming. 

To  evaluate  the  cost  and  benefits  of  dynamic  remapping  we  ran  each  model  five  times 
using  three  different  load  balancing  policies:  no  load-balancing  (except  for  an  initial  distri¬ 
bution  of  path-records  to  each  processor),  dimension  exchange  (DE),  and  scan-based  load 
balancing  (SLB).  For  both  the  DE  and  SLB  methods  an  empty  processor  broadcasts  a  re¬ 
quest  to  load-balance.  A  processor  looks  for  such  a  message  after  every  500th  path-record  it 
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processes  6.  A  load-balance  occurs  after  at  least  one  processor  is  empty  and  all  processors 
have  recognized  the  request. 

Timings  and  processor  utilizations  obtained  from  these  experiments  are  presented  in 
Table  2.  For  each  set  of  five  runs  the  performance  data  shows  little  variation  (with  the 
exception  of  the  number  of  load  balancings  on  Net4),  so  that  the  averages  we  present  are 
quite  typical.  For  each  model  we  give  the  time  required  to  run  the  problem  on  32  nodes 
of  an  Intel  iPSC/860,  the  average  processor  efficiency  at  run-time  (this  figure  multiplied  by 
32  is  the  speedup),  the  average  number  of  load  balances  required  during  the  solution,  and 
the  average  time  required  to  perform  a  load  balance  once  all  processors  are  engaged  in  the 
balancing. 

One  should  also  consider  the  time  required  to  translate  ASSIST  into  C  and  compile  the 
program.  The  build  time  using  the  earlier  Intel  iPSC/2’s  compiler  is  between  one  and  two 
minutes  on  all  of  our  network  problems.  At  the  time  of  this  writing  the  linking  phase  of 
the  iPSC/860  compiler  (which  runs  on  the  same  80386-based  host  computer  used  for  the 
iPSC/2)  requires  far  more  time  than  is  reasonable.  As  we  expect  this  problem  to  be  fixed  in 
the  near  future  we  omit  exact  timings  of  the  build  time. 

Some  features  of  the  parallel  performance  data  in  Table  2  and  its  comparison  with  the 
serial  performance  data  are  noteworthy.  Most  importantly,  our  approach  gives  the  ASSIST 
modeler  the  ability  to  analyze  models  far  more  complex  than  was  ever  practical  using  only 
ASSIST  and  SURE.  High  run-time  processor  efficiencies  are  achieved  on  the  most  complex 
models.  Our  results  suggest  that  even  higher  utilizations  will  be  achieved  on  larger-scale 
problems.  Indeed,  for  each  of  the  problems  studied  here,  extremely  high  utilizations  are 
achieved  when  we  decrease  the  pruning  threshold  further,  thereby  exposing  more  of  the 
state-space  for  analysis. 

Table  3  reports  the  average  number  of  load  balancings  required  under  the  two  schemes, 
and  the  average  cost  of  performing  a  balance.  The  lower  per-balance  cost  of  the  dimension 
exchange  method  can  be  attributed  to  fewer  communication  startups,  and  a  lower  volume 
of  communication.  There  was  sufficient  variation  in  the  number  of  load  balancings  required 
for  Net4  to  suspert  that  the  averages  given  need  not  be  close  to  the  true  means. 

Despite  the  apparent  superiority  of  the  dimension  exchange  method  over  the  scan-based 
method  on  these  problems,  the  main  point  to  be  learned  from  these  experiments  is  that  the 
method  used  to  balance  load  matters  far  less  than  the  decision  to  support  dynamic  load 
balancing  at  all.  Table  2  clearly  shows  the  performance  degradation  suffered  by  the  no-load- 
balancing  strategy.  Load  balancing  is  called  so  infrequently  on  the  larger  problems  that  its 
total  cost  has  only  a  secondary  impact  on  performance. 


7  Summary 

ASSURE  is  a  tool  for  analyzing  the  reliability  of  embedded  computer  control  systems.  AS¬ 
SURE  combines  the  functions  of  two  prior  tools,  ASSIST  and  SURE.  In  doing  so  it  achieves 

6The  cost  of  probing  for  a  the  existence  of  a  possible  message  is  high  on  the  iPSC/860 
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Model 

Avg.  #  Balances 
(DE) 

Avg.  #  Balances 
(SLB) 

Balance  Time  (DE) 
(milliseconds) 

Balance  Time  (SLB) 
(milliseconds) 

38 

14.8 

1 

45.6 

9.7 

14.3 

75.2 

71.5 

17.1 

25.5 

Net4 

133 

109 

21.8 

35.5 

Table  3:  Load  Balancing  Statistics 


significant  run-time  savings  over  ASSIST  and  SURE.  To  reduce  run-times  even  farther,  AS¬ 
SURE  has  been  parallelized  on  a  32-node  Intel  iPSC/860  distributed  memory  multiprocessor. 
The  parallelization  is  automatic — ASSURE’s  users  need  not  concern  themselves  with  any  de¬ 
tails  of  the  parallelization.  On  a  suite  of  moderately  complex  models  ASSURE  achieved  very 
high  processor  run-time  efficiencies.  On  the  iPSC/860  it  has  solved  in  eight  run-time  sec¬ 
onds  a  model  that  formerly  required  half  a  day  on  a  Sun  3/150  workstation.  Furthermore, 
its  input  model  syntax  has  been  extended  to  permit  the  natural  construction  of  models  that 
formerly  could  not  be  easily  expressed.  ASSURE  performs  especially  well  on  large  models, 
a  processor  efficiency  of  85%  is  achieved  on  the  most  complex  (and  realistic)  model  in  our 
suite.  Furthermore,  ASSURE  executes  an  order  of  magnitude  times  faster  on  these  model 
than  ASSURE’s  serial  implementation  on  a  Cray-2.  Our  development  and  testing  of  this 
tool  convincing  demonstrates  the  real  viability  of  using  parallel  machines  to  solve  complex 
reliability  problems. 
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16.  Abstract 

This  paper  describes  a  parallelized  system,  ASSURE,  for  computing  the  reli¬ 
ability  of  embedded  avionics  flight  control  systems  which  are  able  to  reconfigure 
themselves  in  the  event  of  failure.  ASSURE  accepts  a  grammar  that  describes  a  re¬ 
liability  semi-Markov  state-space.  From  this  it  creates  a  parallel  program  that 
simultaneously  generates  and  analyzes  the  state-space,  placint  upper  and  lower 
bounds  on  the  probability  of  system  failure.  ASSURE  is  implemented  on  a  32-node 
Intel  iPSC/860,  and  has  achieved  high  processor  efficiencies  on  real  problems. 
Through  a  combination  of  improved  algorithms,  exploitation  of  parallelism,  and  use 
of  an  advanced  microprocessor  architecture,  ASSURE  has  reduced  the  execution  time  on 
substantial  problems  by  a  factor  of  one  thousand  over  previous  workstation  imple¬ 
mentations.  Furthermore,  ASSURE's  parallel  execution  rate  on  the  iPSC/860  is  an 
order  of  magnitude  faster  than  its  serial  execution  rate  on  a  Cray-2  supercomputer. 
While  dynamic  load  balancing  is  necessary  for  ASSURE’s  good  performance,  it  is  need¬ 
ed  only  infrequently;  the  particular  method  of  load  balancing  used  does  not  sub¬ 
stantially  affect  performance. 
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